Goto

Collaborating Authors

 embedding and clustering


Embedding And Clustering Your Data Can Improve Contrastive Pretraining

Merrick, Luke

arXiv.org Artificial Intelligence

Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified view motivates future lines of research on the organization of contrastive pretraining data.


Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

Neural Information Processing Systems

Drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami operator on a manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for constructing a representation for data sampled from a low di(cid:173) mensional manifold embedded in a higher dimensional space. The algorithm provides a computationally efficient approach to non(cid:173) linear dimensionality reduction that has locality preserving prop(cid:173) erties and a natural connection to clustering. In many areas of artificial intelligence, information retrieval and data mining, one is often confronted with intrinsically low dimensional data lying in a very high di(cid:173) mensional space. For example, gray scale n x n images of a fixed object taken with a moving camera yield data points in rn: n2 . However, the intrinsic dimensionality of the space of all images of t he same object is the number of degrees of freedom of the camera - in fact the space has the natural structure of a manifold embedded in rn: n2 .


Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

Belkin, Mikhail, Niyogi, Partha

Neural Information Processing Systems

Drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami operator on a manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for constructing a representation for data sampled from a low dimensional manifold embedded in a higher dimensional space. The algorithm provides a computationally efficient approach to nonlinear dimensionality reduction that has locality preserving properties and a natural connection to clustering.


Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

Belkin, Mikhail, Niyogi, Partha

Neural Information Processing Systems

Drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami operator on a manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for constructing a representation for data sampled from a low dimensional manifoldembedded in a higher dimensional space. The algorithm provides a computationally efficient approach to nonlinear dimensionalityreduction that has locality preserving properties and a natural connection to clustering.